This project aims to present a different approach to data visualization. It is based on data drom the Palmer Archipelago (Antarctica) and includes information on penguin size, gender, and the islands they live on. The project will contain graphs that show a lot of information and ones that tell almost no story. A brief interpretation will also be added to the charts.
Variables: species, island and sex was changed to factor type and only complete cases was filtered.
# Setting wd and loading data
setwd("C:/Users/Kasper/OneDrive/Pulpit/Applied Statistics/R Learning/Penguin explorig")
dane <- read.csv("penguins_size.csv")
# Essential libraries
library(ggplot2)
library(tidyverse)
library(dplyr)
library(gridExtra)
library(extrafont)
# Quick data cleaning
dane <- dane %>%
mutate(species = factor(species), island = factor(island), sex = factor(sex)) %>%
filter(complete.cases(dane) & dane$sex != ".")
In the graph below it is only possible to see the number of penguins by species. In addition, it would be worth expanding this chart.
ggplot(dane, aes(x=species)) + geom_bar()
The next two charts allow you to see new regularities in the dataset. For example penguins from species Adelie live on every island, Gentoo live only on island Biscoe and Chinstrap live only on Dream island. It can also be seen that in each species the number of males and females is almost equal.
dane %>%
drop_na() %>%
count(island, species) %>%
ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
geom_label(aes(x = species, y = n, label = n)) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
facet_wrap(~island) +
theme_minimal() +
ggtitle('Penguins species number by islands') +
theme(axis.title.y = element_text(color="Grey23", size=16),
axis.text.y = element_text(size=10),
axis.title.x=element_blank(),
legend.title = element_text(size=12),
legend.text = element_text(size=8),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=22, family="serif"))
dane %>%
drop_na() %>%
count(sex, species) %>%
ggplot() + geom_col(aes(x = sex, y = n, fill = sex)) +
geom_label(aes(x = sex, y = n, label = n)) +
scale_fill_manual(values = c("dodgerblue2","coral2")) +
facet_wrap(~species) +
theme_minimal() +
ggtitle('Penguins female & male number by species') +
theme(axis.title.y = element_text(color="Grey23", size=16),
axis.text.y = element_text(size=10),
axis.title.x=element_blank(),
legend.title = element_text(size=12),
legend.text = element_text(size=8),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=22, family="serif"))
The charts below show only the distribution of the variables: flipper_lenth_mm, culmen_depth_mm and flipper_length_mm without the division into qualitative variables.
bad1 <- ggplot(dane, aes(x = culmen_length_mm)) + geom_histogram(binwidth = 2)
bad2 <- ggplot(dane, aes(x = culmen_depth_mm)) + geom_histogram(binwidth = 1)
bad3 <- ggplot(dane, aes(x = flipper_length_mm)) + geom_histogram(binwidth = 5)
grid.arrange(bad1,bad2,bad3)
On the other hand, in the histograms below, the distribution of variables has been presented in a more elegant way and the difference between species broken down into islands in which they live has been shown. Also, we can see distribution of flipper_length_mm broken down into sex and species.
a <- ggplot(dane, aes(x = culmen_length_mm, fill=species)) + geom_histogram(binwidth = 2, alpha = 0.7) + facet_grid(~island) + theme_minimal() +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
ggtitle('Distribution of variables') +
theme(axis.title.y = element_text(color="Grey23", size=11),
axis.text.y = element_text(size=9),
axis.title.x = element_text(color="Grey23", size=11),
axis.text.x = element_text(size=9),
legend.title = element_text(size=10.5),
legend.text = element_text(size=8.5),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=22, family="serif"))
b <- ggplot(dane, aes(x = culmen_depth_mm, fill=species)) + geom_histogram(binwidth = 1, alpha = 0.7) + facet_grid(~island) + theme_minimal() +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
theme(axis.title.y = element_text(color="Grey23", size=11),
axis.text.y = element_text(size=9),
axis.title.x = element_text(color="Grey23", size=11),
axis.text.x = element_text(size=9),
legend.title = element_text(size=10.5),
legend.text = element_text(size=8.5),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"))
c <- ggplot(dane, aes(x = flipper_length_mm, fill=species)) + geom_histogram(binwidth = 5, alpha = 0.7) + facet_grid(~island) + theme_minimal() +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
theme(axis.title.y = element_text(color="Grey23", size=11),
axis.text.y = element_text(size=9),
axis.title.x = element_text(color="Grey23", size=11),
axis.text.x = element_text(size=9),
legend.title = element_text(size=10.5),
legend.text = element_text(size=8.5),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"))
grid.arrange(a,b,c)
ggplot(dane, aes(x = flipper_length_mm, fill=sex)) + geom_histogram(binwidth = 5, alpha = 0.7) + facet_grid(~species, scales="free") + theme_minimal() +
scale_fill_manual(values = c("dodgerblue2","coral2")) +
ggtitle('Distribution of flipper length by sex and species') +
theme(axis.title.y = element_text(color="Grey23", size=16),
axis.text.y = element_text(size=10),
axis.title.x = element_text(color="Grey23", size=16),
axis.text.x = element_text(size=10),
legend.title = element_text(size=12),
legend.text = element_text(size=8),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=22, family="serif"))
This box plot provides some information but could be better represented graphically
ggplot(dane, aes(x = species, y = body_mass_g)) + geom_boxplot()
On the other hand, the box plots below provide more information as it is possible to see the distribution of variables broken down into characteristics such as species and gender.
ggplot(dane, aes(x = species, y = body_mass_g/1000, fill=species)) + geom_jitter() + geom_boxplot(size=1.2, alpha=0.5) + facet_grid(~sex) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) + theme_bw() + ylab("Body weight in kilograms") +
ggtitle("Penguins bodyweight by species and sex") +
theme(axis.title.y = element_text(color="Grey23", size=16),
axis.text.y = element_text(size=10),
axis.title.x=element_blank(),
strip.background = element_rect(fill='lightskyblue1'),
strip.text = element_text(family="serif", face="bold"),
legend.title = element_text(size=12),
legend.text = element_text(size=8),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=22, family="serif"))
ggplot(dane, aes(x = sex, y = culmen_length_mm, fill=sex)) + geom_jitter() + geom_boxplot(size=1.2, alpha=0.5) + facet_grid(~species) +
scale_fill_manual(values = c("dodgerblue2","coral2")) + theme_bw() + ylab("Culmen length in mm") +
ggtitle("Culmen length by species and sex") +
theme(axis.title.y = element_text(color="Grey23", size=16),
axis.text.y = element_text(size=10),
axis.title.x=element_blank(),
strip.background = element_rect(fill='lightskyblue1'),
strip.text = element_text(family="serif", face="bold"),
legend.title = element_text(size=12),
legend.text = element_text(size=8),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=22, family="serif"))
In this scatterplot you only see the black points scattered around. Some clusters can be seen, but to better visualize them, the graph should be supplemented with some qualitative features
ggplot(dane, aes(x = culmen_length_mm, y = culmen_depth_mm)) + geom_point()
After distinguishing the qualitative features, it is clearly visible in the charts below some relationships between the quantitative variables and the qualitative features
ggplot(dane, aes(x = culmen_length_mm, y = culmen_depth_mm, color=species)) + geom_point(aes(shape=species, size=sex), alpha = 0.8) +
scale_shape_manual(values=c(18, 16, 17)) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
theme_minimal() +
ggtitle("Relationship between culmen length and depth") +
theme(axis.title.y = element_text(color="Grey23", size=15),
axis.text.y = element_text(size=12),
axis.title.x = element_text(color="Grey23", size=15),
axis.text.x = element_text(size=12),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=25, family="serif"))
ggplot(dane, aes(x = flipper_length_mm, y = body_mass_g/1000, shape=species, color=species)) + geom_point() + geom_smooth(method=lm) +
theme_minimal() + facet_grid(~sex) +
ggtitle("Relationship between body weight and flipper length") +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
theme(axis.title.y = element_text(color="Grey23", size=15),
axis.text.y = element_text(size=12),
axis.title.x = element_text(color="Grey23", size=15),
axis.text.x = element_text(size=12),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.position = "right",
legend.justification = c(0.94,0.94),
legend.background = element_rect(fill="grey88",
size=0.5, linetype="solid",
colour ="darkslateblue"),
plot.title = element_text(color="darkslateblue", size=25, family="serif"))
Data storytelling is an important aspect of data analysis. The aim of the above project was to show how to improve the extraction of information using properly made visualizations. Adding colors, highlighting variants of quality features and adjusting the appropriate scale allows you to get to know the dataset much better.